Logarithmic Regret for Episodic Continuous-Time Linear-Quadratic Reinforcement Learning Over a Finite-Time Horizon

نویسندگان

چکیده

We study finite-time horizon continuous-time linear-quadratic reinforcement learning problems in an episodic setting, where both the state and control coefficients are unknown to controller. first propose a least-squares algorithm based on observations controls, establish logarithmic regret bound of order $O((\ln M)(\ln\ln M))$, with $M$ being number episodes. The analysis consists two parts: perturbation analysis, which exploits regularity robustness associated Riccati differential equation; parameter estimation error, relies sub-exponential properties estimators. further practically implementable discrete-time piecewise constant achieves similar additional term depending explicitly time stepsizes used algorithm.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Two-warehouse system for non-instantaneous deterioration products with promotional effort and inflation over a finite time horizon

In the current global market, organizations use many promotional tools to increase their sales. One such tool is sales teams’ initiatives or promotional policies, i.e., free gifts, discounts, packaging, etc. This phenomenon motivates the retailer/or buyer to order a large inventory lot so as to take full benefit of promotional policies. In view of this the present paper considers a two-warehous...

متن کامل

Iteratively Extending Time Horizon Reinforcement Learning

Reinforcement learning aims to determine an (infinite time horizon) optimal control policy from interaction with a system. It can be solved by approximating the so-called Q-function from a sample of four-tuples (xt, ut, rt, xt+1) where xt denotes the system state at time t, ut the control action taken, rt the instantaneous reward obtained and xt+1 the successor state of the system, and by deter...

متن کامل

Inventory Model for Non – Instantaneous Deteriorating Items, Stock Dependent Demand, Partial Backlogging, and Inflation over a Finite Time Horizon

In the present study, the Economic Order Quantity (EOQ) model of two-warehouse deals with non-instantaneous deteriorating items, the demand rate considered as stock dependent and model affected by inflation under the pattern of time value of money over a finite planning horizon. Shortages are allowed and partially backordered depending on the waiting time for the next replenishment. The main ob...

متن کامل

Continuous-Time Hierarchical Reinforcement Learning

Hierarchical reinforcement learning (RL) is a general framework which studies how to exploit the structure of actions and tasks to accelerate policy learning in large domains. Prior work in hierarchical RL, such as the MAXQ method, has been limited to the discrete-time discounted reward semiMarkov decision process (SMDP) model. This paper generalizes the MAXQ method to continuous-time discounte...

متن کامل

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic on...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Social Science Research Network

سال: 2021

ISSN: ['1556-5068']

DOI: https://doi.org/10.2139/ssrn.3848428